Netspeak - Assisting Writers in Choosing Words

نویسندگان

  • Martin Potthast
  • Martin Trenkmann
  • Benno Stein
چکیده

NETSPEAK is a Web service which helps writers in finding alternative expressions for what they want to say.1 It provides a large index of writing samples in the form of ngrams, n ≤ 5, along with an efficient means to retrieve them by the use of wildcard queries. When in doubt about a phrasing, a user can get additional evidence by retrieving samples that match a given context. The figure below shows the results for a query where a user is interested in the two most frequently written words between “looks” and “me”. The first two columns give an idea about the customariness of each result, and the user can select the one most appropriate for her sentence. To provide a rich choice of writing samples we index the Google n-gram corpus which was compiled from a large portion of the English Web and which consists of more than 3 billion n-grams along with their occurrence frequencies [2]. We have developed a space-optimal inverted index based on minimal perfect hashing. The hash function maps the vocabulary V of the corpus to the storage positions of postlists. A hash function is perfect if it does not produce hash collisions for the key set V , and it is minimal if the number of storage positions required does not exceed |V |. The hash function is constructed with the CHD algorithm which produces a space overhead of 2.07 × |V | bits [1]. Moreover, the index provides a top-k retrieval strategy to find the n-grams matching a query; details can be found in [3]. The table below shows selected performance data of our index. NETSPEAK is currently deployed on a cluster of 15 computers. In a load test the service was measured to process about 10 000 queries per second.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Choosing words in computer-generated weather forecasts

One of the main challenges in automatically generating textual weather forecasts is choosing appropriate English words to communicate numeric weather data. A corpus-based analysis of how humans write forecasts showed that there were major differences in how individual writers performed this task, that is, in how they translated data into words. These differences included both different preferen...

متن کامل

A Web-based Application for Writing Novels

In this paper, we propose a method for assisting amateur writers in novel writing. Amateur writers can publish their work intensively through web infrastructures. This situation is beneficial, because it encourages amateur writers to enhance their skills by sharing their work. However, writing a good novel is difficult for a novice, because the novel-writing task requires the management of many...

متن کامل

Retrieving Customary Web Language to Assist Writers

This paper introduces NETSPEAK, a Web service which assists writers in finding adequate expressions. To provide statistically relevant suggestions, the service indexes more than 1.8 billion n-grams, n ≤ 5, along with their occurrence frequencies on the Web. If in doubt about a wording, a user can specify a query that has wildcards inserted at those positions where she feels uncertain. Queries d...

متن کامل

Proceedings of the seventh Web as Corpus Workshop ( WAC 7 )

We will discuss backgrounds, technology, and applications developed in the Webis Research Group, whereas the talk’s common thread is the exploitation of the web as a corpus. Three different applications will reveal different rationales and possibilities when operationalizing text reuse and language reuse on a large scale. 1. The Netspeak word search engine reuses the web as a corpus of writing ...

متن کامل

Discourse Community Collocations and L2 Writing Content

Taking the position that writing can be an important skill to foster knowledge building pedagogy, this article explores vocabulary as a supportive tool for this purpose. Having this in mind, a compilation of conceptually loaded vocabularies pertaining to seven discourse communities was developed, two of which were given to a group of L2 writers to investigate the implications of phraseology for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010